An analysis on Frequency of terms for Text Categorization
نویسندگان
چکیده
Preliminary results on a way to reduce terms for text categorization are presented. We have used the transition point; a frequency which splits the words of a text into high frequency words and low frequency words. Thresholds outcoming from document frequency of terms, Information Gain and χ were tested in combination with the transition point. A text categorization experiment based on Rocchio’s method showed that selecting terms whose frequency is lesser than the transition point discarded noise terms without diminishing the categorization task performance. In our experiment, the best result was for term selection based on document frequency of terms threshold in combination with the transition point as a cut.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملCluster Based Hybrid Niche Mimetic and Genetic Algorithm for Text Document Categorization
An efficient cluster based hybrid niche mimetic and genetic algorithm for text document categorization to improve the retrieval rate of relevant document fetching is addressed. The proposal minimizes the processing of structuring the document with better feature selection using hybrid algorithm. In addition restructuring of feature words to associated documents gets reduced, in turn increases d...
متن کاملUsing Class Frequency for Improving Centroid-based Text Classification
Most previous works on text classification, represented importance of terms by term occurrence frequency (tf) and inverse document frequency (idf). This paper presents the ways to apply class frequency in centroid-based text categorization. Three approaches are taken into account. The first one is to explore the effectiveness of inverse class frequency on the popular term weighting, i.e., TFIDF...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملDomain Keyword Extraction Technique: a New Weighting Method Based on Frequency Analysis
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those applications such as search engine, text categorization, summarization, and topic detection are based on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Procesamiento del Lenguaje Natural
دوره 33 شماره
صفحات -
تاریخ انتشار 2004